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Abstract. Phylogenetic invariants are certain polynomials in the joint probability 
distribution of a Markov model on a phylogenetic tree. Such polynomials are of theoreti- 
cal interest in the field of algebraic statistics and they are also of practical interest — they 
can be used to construct phylogenetic trees. This paper is a self-contained introduction 
to the algebraic, statistical, and computational challenges involved in the practical use 
of phylogenetic invariants. We survey the relevant literature and provide some partial 
answers and many open problems. 
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1. Introduction. The emerging field of algebraic statistics (cf. |42j ) 
has at its core the belief that many statistical problems are inherently al- 
gebraic. Statistical problems are often analyzed by specifying a model — a 
family of possible probability distributions to explain the data. In particu- 
lar, many statistical models are defined parametrically by polynomials and 
thus involve algebraic varieties. From this point of view, one would hope 
that the ideal of polynomials that vanish on a statistical model would give 
statistical information about the model. This is not a new idea in statis- 
tics, indeed, tests based on polynomials that vanish on a model include the 
odds-ratio, which is based on the determinant of a two by two matrix. The 
polynomials that vanish on the statistical model have come to be known 
as the ( algebraic ) invariants of the model. 

The field of phylogenetics provides important statistical and biological 
models with interesting combinatorial structure. The central problem in 
phylogenetics is to determine the evolutionary relationships among a set of 
taxa (short for taxonomic units, which could be species, for example). To 
a first approximation, these relationships can be represented using rooted 
binary trees, where the leaves correspond to the observed taxa and the 
interior nodes to ancestors. For example, Figure [T] shows the relationships 
between a portion of a gene in seven mammalian species. 

Phylogenetic invariants are polynomials in the joint probability distri- 
bution describing sequence data that vanish on distributions arising from 
a particular tree and model of sequence evolution. The first of the invari- 
ants for phylogenetic tree models were discovered by Lake and Cavender- 
Felsenstein [36l [14]. This set off a fiurry of work: in mathematics, gener- 
alizing these invariants (cf. [271 [211 [52] ) in phylogenetics, using these 
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invariants to construct trees (cf. [551 [351 [321 Hll SS] ) . However, the linear 
invariants didn't fare well in simulations [5D] and the idea fell into disuse. 

However, the study of phylogenetic invariants was revived in the field 
of algebraic statistics; the subsequent theoretical (cf. [21 \5U\ [I2l [5]) and 
practical (cf. [13l [HI [181 [201 [34] ) developments have given cause for opti- 
mism in using invariants to construct phylogenetic trees. There are benefits 
to these algebraic tools; however, obstacles in algebraic geometry, statis- 
tics, and computer science must be overcome if they are to live up to their 
potential. In this paper, we formulate and analyze some of the fundamental 
advantages and difficulties in using algebraic statistics to construct phylo- 
genetic trees, describing the current research and formulating many open 
problems. 

In geometric terms, the problem of phylogenetic tree construction can 
be stated as follows. We observe DNA sequences from n different taxa 
and wish to determine which binary tree with n leaves best describes the 
relationships between these sequences for a fixed model of evolution. Each 
of these trees corresponds to a different algebraic variety in . The DNA 
sequences correspond to a certain point in as well. Picking the best 
tree means picking the variety that is closest to the data point in some 
sense. Since the data will not typically lie on the variety of any tree, we 
have to decide what is meant by "close" . 

Denote the variety (resp. ideal) associated to a tree T by V{T) (resp. 
/(T)). Our main goal, then, is to understand how the polynomials in I{T) 
can be used to select the best tree given the data. In order to answer this 
question, there are five fundamental obstacles. 

1. Formulate an appropriate model of evolution and determine the 
invariants for that model, if possible in a form that can be evaluated 
quickly. 

2. Choose a finite set of polynomials in I{T) with good discriminating 
power between different trees. 

3. Given a set of invariants for each tree, define a single score that 
can be used to compare different trees. 

4. Since the varieties are in R^ , each polynomial is in exponentially 
many unknowns. Thus even evaluating a single invariant could 
become difficult as n increases. This is in addition to the problem 
that the number of trees and the codimension of V{T) increase 
exponentially. Phylogenetic algorithms arc often used for hundreds 
of species. Can invariants become practical for large problems? 

5. Statistical models are not complex algebraic varieties; they make 
sense only in the probability simplex and thus are real, semi- 
algebraic sets. This problem is more than theoretical — it is quite 
noticeable in simulated data (see Figures [6l and [T]). Can semi- 
algebraic information be used to augment the invariants? 

In the remainder of the paper, we will analyze these problems in detail, 
showing why they are significant and explaining some methods for dealing 
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with them. The first problem (determining phylogenetic invariants) has 
been the focus of substantial research, thus we deal here with only the last 
four problems. We begin by introducing phylogenetics and constructing 
and using some phylogenetic invariants, then consider the four problems in 
order. 

While in this paper we concentrate solely on the problem of construct- 
ing phylogenetic trees using invariants, we should note that phylogenetic 
invariants are interesting for many other reasons. On the theoretical side 
of phylogenetics, they have been used to answer questions about identi- 
fiability (e.g., The study of the algebraic geometry arising from 

invariants has led to many interesting problems in mathematics [181 IHl US] . 

2. Background. We give here a short, self-contained introduction to 
phylogenetics and phylogenetic invariants. For a more thorough survey of 
algebraic methods in phylogenetics, see |4j. Also see [231 SS] for more of 
the practical and combinatorial aspects of phylogenetics. 

Definition 2.1. Let X be a set of taxa. A phylogenetic tree T on X 
is a unrooted binary tree with \X\ leaves where each leaf is labelled with 
an element of X and each edge e of T has a weight, written t^ and called 
the branch length. 

While we include branch lengths in our definition of phylogenetic trees, 
our discussions about constructing trees are about only choosing the cor- 
rect topology (meaning the topology of the labelled tree), not the branch 
lengths. While estimating branch lengths is relatively easy using maximum 
likelihood methods after a tree topology is fixed (e.g., with [54]), it is an in- 
teresting question whether algebraic ideas can be used to estimate branch 
lengths (see [48] for algebraic techniques for estimating parameters in 
invariable-site phylogenetic models). 

Phylogenetics depends on having identified homologous characters be- 
tween the set of taxa. For example, historically, these characters might be 
physical characteristics of the organisms (for example, binary characters 
might include the following: are they unicellular or multicellular, cold- 
blooded or hot-blooded, egg-laying or placental mammals). In the era of 
genomics, the characters are typically single nucleotides or amino acids that 
have been inferred to be homologous (e.g., the first amino acid in a certain 
gene that is shared in a slightly different form among many organisms). 
For example, see Figure[2l which shows a multiple sequence alignment. We 
will throughout make the typical assumption that characters evolve inde- 
pendently, so that each column in Figure [2] is an independent, identically 
distributed (i.i.d.) sample from the model of evolution. While both DNA 
and amino acid data are common, we will work only with DNA and thus 
use the alphabet S = {A, C, G, T}. 

We assume that evolution happens via a continuous-time Markov pro- 
cess on a phylogenetic tree (see [41] for general details about Markov 
chains). That is, along each edge e there is a length te and a rate ma- 
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Fig. 1. Phylogenetic tree for seven mammalian species derived from an align- 
ment of a portion of the HOXA region (ENCODE region ENmOlO, see \17^ and 
genome.ucsc.edu/encode). This tree was built using the dnaml maximum likelihood 
package from PHYLIP !24J on an alignment partially shown in Figure [3 
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Fig. 2. Multiple sequence alignment of length 180 from the HOXA region of seven 
mammalian genomes. Dashes indicate gaps; bases are colored according to their simi- 
larity across the species. 



trix Qe giving the instantaneous rates for evolution along edge e. Then 
Me = e^"*"' is the transition matrix giving the probabilities of substitutions 
along the edge. In order to work with unrooted trees, we will assume that 
the Markov process is reversible, that is, HiMf,{i,j) — TTjMe{j,i), where tt 
is the stationary distribution of Me- In order for 6*3=*= to be stochastic, we 
must have Q{i,i) < 0, Q{i,j) > for i ^ j, and J2j Qihj) — for all i. 
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This form of rate matrix is known as the Jukes-Cantor model [33j . For 
example, the probability of changing from an A to a C along edge e is given 

byMe(l,2) = i^^. 

Commonly used models that are more realistic than the Jukes-Cantor 
model include the Kimura 3-parameter model [35j where the rate matrices 
are of the form 

/• 7 a /3\ 

7 • /? a 
a /? • 7 
V/3 a 7 7 

where • = —7 — a — p. See [42} Figure 4.7] for a description of many other 
possible models. 

In order to obtain the joint distribution of characters at the leaves of 
the trees, we have to choose a root of the tree (arbitrarily, since the pro- 
cesses are time reversible), and run the Markov process down the edges of 
the tree, starting from a distribution of the characters at the root. The 
result is a joint probability distribution p = (pa...a, ■ • ■ ,Pt...j), and the im- 
portant point is that the coordinates of p can be written as polynomials in 
the transition probabilities. That is, the model is specified parametrically 
by polynomials in the entries of Me . We will forget about the specific form 
of the entries of Me = e*^=*= and instead treat each entry of Me as an 
unknown. Thus for the Jukes-Cantor model, we have two unknowns per 



edge: ag 



l+3e 



and Pe 



This makes the algebraic model 



4 """"^ h^e — 4 

more general than the statistical model (as it allows probabilities in the 
transition matrices to be negative or even complex). Although this allows 
algebraic tools to be used, we will see in Section [7] that it can be a disad- 
vantage. Generally speaking, there are two types of phylogenetic models 
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that have been thoroughly studied from the algebraic viewpoint: "group 
based" models such as the Jukes-Cantor and Kimura models, and variants 
of the general Markov model, in which no constraints are placed on the 
transition matrices. 

In this paper, we define the phylogenetic invariants for a model of evo- 
lution and a tree to be the polynomials in the joint probabilities that vanish 
if the probabilities come from the model on the tree. For example, for a 
quartet tree (an unrooted binary tree with four leaves, see Figure [3]) under 
the Jukes-Cantor model, Paaaa — Pcccc = 0, due to the symmetry built into 
the model. However, this polynomial doesn't differentiate between trees — 
it lies in the intersection of the ideals of the three quartet trees. Beware 
that there are two commonly used definitions of phylogenetic invariants. 
Originally, they were defined as polynomials that vanish on probability dis- 
tributions arising from exactly one tree, so the above polynomial would be 
excluded. However, it is more algebraically convenient to take as invariants 
the full set of polynomials that vanish, as this forms an ideal. We spend 
the rest of this section deriving a particularly important polynomial. 

A class of phylogenetic methods bypass working with the joint prob- 
ability distribution and instead only estimate the distances between each 
pair of taxa. The goal then is to find a tree with branch lengths such that 
the distance along edges of the tree between pairs of leaves approximates 
the estimated pairwise distances. To use these distance methods, we first 
need a couple of definitions. We will concentrate in this paper on quartet 
trees, i.e., trees with four leaves. There are 3 different (unrooted, binary) 
trees on four leaves, we will write them (01 : 23), (02 : 13), and (03 : 12), 
corresponding to which pairs of leaves are joined together. 

Definition 2.2. A dissimilarity map d e R^^^ satisfies d(i,j) = 
rf(jj *) > and d{i, i) — 0. We say that d is a tree metric if there exists a 
phylogenetic tree T with non-negative branch lengths t^ such that for every 
pair i,j of taxa, d{i,j) is the sum of the branch lengths t^ on the edges of 
T connecting i and j . 

Proposition 2.1 (Four-point condition [10,). A dissimilarity map d 
is a tree metric if and only if for every i,j, k, and I, the maximum of the 
three numbers 

dij + dku dik + dju and da + djk 

is attained at least twice. 

Example 2. Let us restrict our attention to a tree with four leaves, 
{ij : kl). In this case, the four-point condition becomes (see Figure [3]) 

dij + dki < dtk + dji = d^i + djk- (2.2) 

The equality in the four-point condition can be translated into a quadratic 
polynomial in the probabilities, however, we first have to understand how to 
transform the joint probabilities into distances. Distances can be estimated 
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Fig. 3. The four-point condition. 



from data in a variety of ways (there are different formulas for the maximum 
Ukelihood estimates of the distances under different models of evolution, see 
PSI Chapter 13]). The formula for the general Markov model is the logdet 
distance, which mimics what we saw above (j2.ip . in that a transition matrix 
is estimated and the distance is taken to be the log of the determinant of 
this matrix. 

Here we will use a simpler formula for the distance, under the Jukes- 
Cantor model (Example [T|) . The maximum likelihood estimate of the dis- 
tance between two sequences under the Jukes-Cantor model is given by 
dij = — I log ^1 — il^^ where is the fraction of mismatches between 
the two sequences, e.g., 

mi2 = ^ Pijki 

After plugging this distance into the four point condition, cancelling, 
and exponentiating, the equality in (|2.2p becomes 



1 - ^m-ik^ ^1 - ^"^J' j ^ ^ ^™'') ^ ^"^ly = 0. (2.3) 

We will call this polynomial the four-point invariant. This construction is 
originally due to Cavender and Felsenstein 14J. 

Example [2] shows one of the first constructions on a phylogenetic in- 
variant, in the same year as the discovery by Lake of linear invariants 
|36j . There is a linear change of coordinates on the probability distribution 
p under which I{T) has a generating set of binomials. In particular, in 
these coordinates, a simple calculation shows that (|2.3p becomes a bino- 
mial. Known as the Hadamard or Fourier transform [27l |52j [211 [50] , this 
change of coordinates transforms the ideals of invariants for several models 
of evolution into toric ideals [49]. It should be emphasized, however, that 
this transform is only known to exist for group-based models. 

The four-point invariant is a polynomial in the joint probabilities that 
vanishes on distributions arising from a certain quartet tree. Define the 
ideal Im {T) of invariants for a model M of evolution on a tree T to be the 
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set of all polynomials that are identically zero on all probability distribu- 
tions arising from the model M on T. We will write only /(T) when is 
clear from context. 

3. How to use invariants. The basic idea of using phylogenetic 
invariants is as follows. A multiple sequence alignment DNA alignment of 
n species gives rise to an empirical probability distribution p G . This 
occurs simply by counting columns of each possible type in the alignment, 
throwing out all columns that contain a gap (a "-" symbol). For example, 
Figure [5] has exactly one column that reads "CCCACCC" (the first) out of 
107 gap-free columns total, so pcccAccc = 1/107. 

Then if / is an invariant for tree T under a certain model of evolution, 
we expect /(p) ~ if (and generically only if) the alignment comes from 
the model on T. More precisely, where pat is the empirical distribution after 
seeing N observations from the model on T, then limjv^oo E{f{pN)) = 0. 

We thus have a rough outline of how to use phylogenetic invariants to 
construct trees: 

1. Choose a model Ai of evolution. 

2. Choose a set of invariants fr for model A4 for each tree T with n 
leaves. 

3. Evaluate each set of invariants at p. 

4. Pick the tree T such that frip) is smallest (in some sense). 
However, all of these steps contain difficulties: there are infinitely many 

polynomials to pick in exponentially many unknowns and exponentially 
many trees to compare. We will discuss step 2 in Section |4l step 3 in 
Section[6l and step 4 in Section[5l Selecting a model of evolution is difficult 
as well. There is, as always, a trade-off between biological realism (which 
could lead to hundreds of parameters per edge) and statistical usefulness 
of the model. 

Since the rest of this paper will discuss difficulties with using invari- 
ants, we should stop and emphasize two especially promising features of 
invariants: 

1. Invariants allow for arbitrary rate matrices. One major challenge 
of phylogenetics is that evolution does not always happen at one rate. 
But common methods for constructing trees generally assume a single rate 
matrix Q for all edges, leading to difficulties on data with heterogeneous 
rates. While methods have been developed to solve this problem (cf. [551 
nil [26]), it is a major focus of research. 

In contrast, phylogenetic invariants allow for differing rate matrices 
within the chosen model on every edge (and in fact, even changing rate 
matrices along a single edge). The invariants for the Kimura 3-parameter 
model [35j have been shown to outperform neighbor-joining and maximum 
likelihood on quartet trees for heterogeneous simulated data [11_ . To be fair, 
we should note that the invariants in this analysis were based on the correct 
model (i.e., the Kimura 3-parameter with heterogeneous rates, which the 



USING INVARIANTS FOR PHYLOGENETIC TREE CONSTRUCTION 9 

data was simulated from) while the maximum likelihood analysis used an 
incorrect model (with homogeneous rates) due to limitations in standard 
maximum likelihood packages. 

2. Invariants perhaps can test individual features of trees. Researchers 
are frequently interested in the validity of a single edge in the tree. For 
example, we might wonder if human or dog is a closer relative to the rabbit. 
This amounts to wondering about how much confidence there is in the 
edge between the human-rabbit-mouse-rat subtree and the dog subtree in 
Figure[T] There are methods, most notably the bootstrap I22J and Bayesian 
methods (cf. 32J), which provide answers to this question, but there are 
concerns about their interpretation [551 ttEl SOI H] • 

As for phylogenetic invariants, the generators of the ideal I{T) are, in 
many cases, built from polynomials constructed from local features of the 
tree. Thus invariants seem to be well suited to test individual features of 
a tree. For example, suppose we have n taxa. Consider a partition {A, B} 
of the taxa into two sets. Construct the x matrix F\a,tA,B{p) 

where the rows are indexed by assignments of S to the taxa in A and the 
columns by assignments of E to the taxa in B. The entry of the matrix in 
a given row and column is the joint probability of seeing the corresponding 
assignments of S to A and B. The following theorem is [6] Theorem 4] and 
deals with the general Markov model, where there are no conditions on the 
form of the rate matrices. 

Theorem 3.1 (AUman-Rhodes) . Let E = {0, 1} and let T be a binary 
tree under the general Markov model. Then the 3x3 minors ofFlaiA.Bip) 
generate I{T) for the general Markov model, where we let A, B range over 
all partitions of [n] that are induced by removing an edge of T . 

While the polynomials in Theorem 13.11 do not generate the ideal for 
the DNA alphabet, versions of these polynomials do vanish for any Markov 
model on a tree. A similar result also holds for the Jukes-Cantor model in 
Fourier coordinates; the following is part of ^W, Thm 2]. 

Theorem 3.2 (Sturmfels-SuUivant) . The ideal for the Jukes-Cantor 
DNA model is generated by polynomials of degree 1, 2, and 3 where the 
quadratic (resp. cubic) invariants are constructed in an explicit combina- 
torial manner from the edges ( resp. vertices ) of the tree. 

Although there are many challenges to overcome, the fact that phylo- 
genetic invariants are associated to specific features of a tree provides hope 
that they can lead to a new class of statistical tests for individual features 
on phylogenetic trees. 

4. Choosing powerful invariants. There are, of course, infinitely 
many polynomials in each ideal /(T), and it is not clear mathematically or 
statistically which should be used in the set fr of invariants that we test. 
For example, we might hope to use a generating set, or a Grobner basis, 
or a set that locally defines the variety, or a set that cuts out the variety 
over R. We have no actual answers to this dilemma, but we provide a few 
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Fig. 4. Distributions of three invariants (the four-point invariant, Lake's linear 
invariant, and a biased invariant) on simulated data. The white histogram corresponds 
to the correct tree, the black and gray are the other two trees. The invariants have quite 
different variances and performance. 



illustrative examples and suggest possible criteria for an invariant to be 
powerful. We will deal with the Jukes-Cantor model on a tree with four 
leaves; the 33 generators for this ideal can be found on the "small trees" 
website www . shsu . edu/'-^idg065/sinall-trees/| [T5] . 

We believe that symmetry is an important factor in choosing powerful 
invariants. The trees with four leaves have a very large symmetry group: 
each tree can be written in the plane in eight different ways (for example, 
one tree can be written as (01 : 23), (10 : 23), . . . , (32 : 10)), and each 
of these induces a different order on the probability coordinates piju- This 
symmetry group (Z2 x Z2 x Z2) acts on the ideal I{T) as well. In order 
that the results do not change under different orderings of the input, we 
should choose a set fp of invariants that is closed (up to sign) under this 
action. After applying this action to the 33 generators, we get a set of 49 
invariants. This symmetry will also play an important role in our metric 
learning algorithms in Section [5l See also [51 for a different perspective 
on symmetry in phylogenetics. 

We begin by showing how different polynomials have drastically differ- 
ent behavior. Figure 3] shows the distribution of three of the invariants on 
data from simulations of alignments of length 1000 from the Jukes-Cantor 
model on (01 : 23) for branch lengths ranging from 0.01 to 0.75 (similar to 
[3D1 [m HO] ) ■ The histograms show the distributions for the simulated tree 
in white and the distributions for the other trees in gray and black. The 
four-point invariant (left) distinguishes nicely between the three trees with 
the correct tree tightly distributed around zero. It is correct almost all of 
the time. Lake's linear invariant (middle) also shows power to distinguish 
between all three trees, but distributions overlap much more — it is only 
correct about half of the time. The final polynomial seems to be biased 
towards selecting the wrong tree, even though it does not lie in I[T) for 
either of the other two trees. 

Figure shows the performance of all the generators for this ideal on 
simulated data. The four-point invariant is the best, but the performance 
drops sharply with the other generators. Notably, the four-point invariant 
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Fig. 5. Prediction rate for the 49 Jukes-Cantor invariants on simulated data of 
length 100. The four-point invariant is by far the best, although four other invariants 
are quite good. 

and several of the other powerful ones are unchanged (aside from sign) 
under the symmetries of the tree. While any invariant can be made sym- 
metric by averaging, this behavior leads us to believe that invariants with 
a simple, symmetric form may be the best choice. 

For more complex models, it becomes even more necessary to pick a 
good set of invariants since there are prohibitively many generators of the 
ideal. The paper [12] describes an algebraic method for picking a subset 
of invariants for the Kimura 3-parameter model, which has 11612 genera- 
tors for the quartet tree (after augmenting by symmetry). Their method 
constructs a set of invariants which is a local complete intersection, and 
shows that this defines the variety on the biological relevant region. This 
reduces the list to 48 invariants which overall behave better than all 11612 
invariants. However, of these 48, only 4 rank among the top 52 invariants 
in prediction rate (using simulations similar to those that produced Fig- 
ure [5]) and the remaining 44 invariants are mostly quite poor (42% average 
accuracy). This result, while of considerable theoretical interest, doesn't 
seem to give an optimal set of invariants. 

5. Comparing trees. Once we have chosen a set of invariants for 
each tree T, we want to pick the tree such that {t{p) is smallest (in some 
sense). The examples in Section 3] show why this is a non-trivial problem — 
different invariants have different power and different variance and thus 
should be weighted differently in choosing a norm on fp. In this section, 
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we briefly describe an approach to normalizing the invariants to enable us 
to choose a tree. It is based on machine learning and was developed in pUj . 
It leads to large improvements over previous uses of invariants; however, it 
is computationally expensive and dependant on the training data. It can 
be thought of as finding the best single invariant which is a quadratic form 
in the starting set fp of invariants. 

There are also standard asymptotic statistical tools such as the delta 
method for normalizing invariants to have a common mean and variance. 
They have the disadvantage of depending on a linear approximation and 
asymptotic behavior, which might not be accurate for small datasets. For- 
tunately, the varieties for many phylogenetic models are smooth in the 
biologically significant region [12] , so linear approximations may work well. 

This problem is somewhat easier when we are choosing between dif- 
ferent trees with the same (unlabelled) topology, for example, the three 
quartet trees. In this case the different ideals I{T) are the same under a 
permutation of the unknowns, and thus we are comparing the same sets of 
polynomials (as long as the chosen set fr is closed under the symmetries 
of T) . For this reason, we will concentrate on the case of quartet trees and 
write Ti = (01 : 23), T2 = (02 : 13), and Tg = (03 : 12). 

Let p{9) be an empirical probability distribution generated from a 
phylogenetic model on tree Ti with parameters 9. Choose m invariants ii 
{i = 1, 2, 3) that are closed under the symmetries of Ti. We want a norm 
II II* such that 

\\h{p{0m*<^^i\\Hm)\UA\Hpm\u) (5.1) 

is typically true, i.e., the true tree should have its associated invariants 
closer to zero than others on the relevant range of parameter space. 

In order to scale and weigh the individual invariants, the algorithm 
seeks to find an optimal || ||* within the class of Mahalanobis norms. Recall 
that given a positive (semi) definite matrix A, the Mahalanobis (semi)norm 
\\ ■ \\a is defined by 

\\x\\a = Vx*~Ax. 

Since A is positive semidefinite, it can be written as A = UDU* where 
U is orthogonal and D is diagonal with non- negative entries. Thus the 
positive semidefinite square root B = U\/DU^ is unique. Now since ||a;||^ = 
x*" Ax — {BxY{Bx) = ||i?a;|p, learning such a metric is the same as finding 
a transformation of the space of invariants that replaces each point x with 
Bx under the Euclidean norm, i.e., a rotation and shrinking/stretching of 
the original x. 

Now suppose that is a finite set of parameters from which training 
data fi{p{6)),i2{p{0j),i3{p{9)) is generated for 9 £ &. As we saw above, 
each of the eight possible ways of writing each tree induces a permuta- 
tion of the coordinates Pijki and thus induces a signed permutation of 
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the coordinates of each {i{p{6)). Write these permutations in matrix form 
as TTi, . . . ,7rg. Then the positive semidefinite matrix A must satisfy the 
symmetry constraints iTiA = Ani which are hyperplanes intersecting the 
semidefinite cone. This symmetry condition is crucial in reducing the com- 
putational cost. Given training data, the following optimization problem 
finds a good metric on the space of invariants. 

Minimize: Leee ^(^) + ^^^^ 

Subject to: ||fi(p(0))||i + 7 < ||f.(pW)|li + (for z = 2,3), 

TT,A = A'K, (forl<i<8), 
^{9) > 0, and 
AhO, 

(5.2) 

where A ^ denotes that A is a positive semidefinite matrix, so this is a 
semidefinite programming problem. There are several parameters involved 
in this algorithm: (,{9) ioi 9 E Q are slack- variables measuring the violation 
of (|5.ip , 7 is a margin parameter that lets us strengthen condition (|5.ip , and 
A is a regularization parameter to keep the trace tvA small while keeping 
A as low rank as possible. It tries to find a positive semidefinite A at a 
trade-off between the small violation of (|5.ip and small trace A. 

The metric learning problem (j5.2p was inspired by some early results 
on metric learning algorithms [531 147] , which aim to find a Mahalanobis 
(semi)norm such that the mutual distances between similar examples are 
minimized while the distances across dissimilar examples or classes are kept 
large. If it becomes too computationally expensive, we can restrict A to be 
diagonal, which reduces the problem to a linear program. See [20' for details 
and simulation results. The learned metrics significantly improve on any of 
the individual invariants as well as on unweighted norms. The semidefinite 
programming algorithm is computationally feasible for approximately 100 
invariants, and the choice of powerful invariants is important. 

6. Efficient computation. At first glance, the problem of using in- 
variants seems intractable for large trees for the simple reason that the 
number of unknowns grows exponentially with the number of leaves. How- 
ever, the problem is not as bad as it may seem. Phylogcnetic analyses typ- 
ically use DNA sequences at most thousands of bases long, which means 
that the empirical distribution p G will be extremely sparse even with 
a relatively small number of taxa. 

Also the data can be sparse, this will not help unless we can write 
down the invariants in sparse form. If the polynomials can be written 
down in an effective way, they can be evaluated quickly. The determinantal 
form of the invariants in Theorem 13.11 provide such a form; see |18j for an 
algorithm to construct phylogenetic trees in polynomial time using these 
invariants and numerical linear algebra. Also see [5] for invariants that 
are (in some sense) determinantal. It seems that determinantal conditions 
could be particularly useful, so we suggest Problem 18.51 to computational 
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commutative algebraists (see also 19 ). 

Unfortunately, for group-based models the polynomials are only sparse 
when written in Fourier coordinates, and the Fourier transform takes a 
sparse distribution p and produces a completely dense vector q. Many 
of the invariants are determinantal in Fourier coordinates, but since the 
matrices are dense, they are difficult to write down. Can these polynomials 
be evaluated efficiently? 

7. Positivity. Recall that the four point condition (Proposition 12.11 
and Figure [3]) says that for a dissimilarity map d arising from the quartet 
tree (01 : 23), 

doi + < do2 + di-i = do3 + di2. (7.1) 

This is true since the right two sums traverse the inner edge of the tree 
twice (Figure [3|). We saw in Example [2] that the equality in (|7.ip translates 
to a quadratic invariant. However, notice that if the interior branch of the 
tree has negative length, the equality is still satisfied, but the inequality 
changes so that dgi + ^23 is now larger than the other two sums. 

The widely used neighbor-joining algorithm [43 , when restricted to 
four taxa, reduces to finding the smallest of the three sums in the four-point 
condition. That is, neighbor-joining on a quartet tree involves estimating 
the distances as in Section [2] and then returning the tree {ij : kl) that 
minimizes -I- d^i ■ If instead we used the quadratic invariant arising from 
the equality in the four point condition, we would have an invariant based 
method that simply returns the tree {ij : kl) that minimizes Idik+dji — du — 
djk\- We saw in Section [?] that this invariant performs quite well compared 
to the other generators of the Jukes-Cantor model. However, it compares 
poorly to the neighbor-joining criterion in the following way. 

Figure [S] shows the difference between these two selection criteria on 
a projection of the six dimensional space of dissimilarity maps R^^) to two 
dimensions. The three black lines are the projections of distances arising 
from the three different trees. Moving out from the center along these lines 
corresponds to increasing the length of the inner edge in the tree. 

Geometrically, neighbor-joining can be thought of as finding the closest 
tree metric (a point on a black half-ray) to a dissimilarity map. The four- 
point invariant can't distinguish negative inner branch length (the dotted 
black line) and thus is much less robust than neighbor-joining. Notice that 
even when it picks the wrong tree, it can pick the wrong wrong tree — that 
is, the one least supported by the data. It is less robust than neighbor- 
joining in the "Felsenstein zone" [31], which corresponds to the region close 
to the center, where the inner edge is very short. 

Simulations (see Figure [7]) show that building trees by evaluating this 
quadratic invariant does not perform nearly as well as neighbor-joining. 
This is because many simulations with a short interior branch tend to 
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Fig. 6. The selection criteria for neighbor- joining (left) and the four-point invari- 
ant (right) projected to two dimensions. The colored/shaded regions show which dis- 
similarity maps are matched to which trees by the two algorithms. The white/unshaded 
area corresponds to tree (01 : 23), the red/solid area to tree (02 : 13) and the blue/striped 
area to (03 ; 12). 




Neighbor joining Four-point invariant 

Fig. 7. Illustration of FiaureWlon simulated data. Simulated alignments from tree 
(01 ; 23) of length 100 were created for randomly chosen branch lengths between 0.01 
and 0.75. Distances were estimated using the Jukes-Cantor model and projected onto 
two dimensions in the same way as in Figure [3 Trees were built from the distances 
using both neighbor- joining and the four-point invariant. Black circles correspond to 
distances assigned tree (01 ; 23), red x's to tree (02 : 13), and blue diamonds to tree 
(03 : 12). 



return metrics that seem to come from trees with negative inner branch 
lengths. 

This seems to be a large blow to the method of invariants: even the 
most powerful invariant on our list in Section [4] doesn't behave as well as 
this simple condition. However, it can be easily seen that testing the in- 
equality is equivalent to testing the signs of the invariant instead of the 
absolute value, which leads us to ask if invariants can provide a way to dis- 
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cover conditions similar to that used in neighbor-joining (see Problem l8.7p . 
The original paper of Cavender-Felsenstein [Hj also suggested using in- 
equalities, although no one seems to have followed up on this idea. 

8. Open problems. Problem 8.1. Can algebraic ideas he used to 
estimate branch lengths and other parameters in phylogenetic trees? See 
14 S[ for algebraic techniques for estimating parameters in invariable- site 
phylogenetic models. 

Problem 8.2. Investigate the behavior of individual invariants on 
data from trees with heterogeneous rates. Are the best invariants the same 
ones that are powerful for homogeneous rates ? 

Problem 8.3. Can asymptotic statistical methods be practically used 
to normalize invariants? Do they give any information about the power of 
individual invariants? 

Problem 8.4. Do the metrics constructed by the machine learning 
algorithm in Section [5| shed any light on the criteria for invariants to be 
powerful? 

Problem 8.5. Define the "determinantal closure" of an ideal I and 
develop algorithms to calculate it. See also 

Problem 8.6. For group-based models, does Fourier analysis pro- 
vide a method to efficiently evaluate polynomials in the Fourier coordinates 
without destroying the sparsity of the problem? Note that many of the in- 
variants are determinental in Fourier coordinates. 

Problem 8.7. Are there other phylogenetic invariants (say for quartet 
trees under the Jukes-Cantor model) "similar" to the four-point invariant? 
We suggest the following conditions: 

1. Be fixed (up to sign) under the Z2 x Z2 x Z2 symmetries of the 
quartet tree. 

2. Have the following sign condition: if{p) > for all p from T2 and 
T3 (with perhaps a different choice of sign for T2 and T3). See for 
example, the symmetries of the left suhfigure in Figure [^J 

Beware that results such as |^ on the uniqueness of the neighbor-joining 
criterion place some constraints on whether we can hope to find invariants 
mimicking this behavior. 
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